On July 1st, 2016, Chicago Department of Public Health (CDPH) identified an E. coli outbreak linked to a restaurant in Bridgeport neighborhood of Chicago. Within days, the number of cases swelled from initially report 25 cases to 65. The outbreak resulted in stunning 20 hospitalizations. By the time, the dust settled, over 100 people, including 40 food handlers from the restaurant, suffered illness. The outbreak, one of the worst in the cities recent history, exposed the critical work conducted by the city’s restaurant safety inspectors. Their report of the July 1st inspection identified critical failure in revealing “improper temperatures for several food items”. The actions mitigated the effect of an outbreak which could have been much worse, had it not been identified promptly.
This application presents a helpful tool to spatially predict inspections needed, created by forecasting the number of restaurant inspection failures (and thus, the number of additional inspections required) in Chicago for a given year.
The city of Chicago employs approximately 3 dozen restaurant food inspectors. With over 15,000 inspections taking place in Chicago in 2019 alone, the sheer volume of this task necessitates prediction-aided planning. Our application, presented here, allows for greater precision in anticipating possible inspection failures using machine-learning algorithms. In adopting our application, the city of Chicago shortens the feedback loop in identification of critical failures, thus preventing future outbreak. In addition to improving public health outcomes, our application improves the allocation of the limited number of inspectors yielding greater efficiency. The built-in scheduling algorithm prioritizes inspections in the high risk areas with mechanisms for revising inspector schedules based on latest available data. We know that certain risk-factors can significantly increase chance of a failure–we incorporate these risks real-time.
# Create a map containing boundary of Chicago.
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
# Create a fishnet containing 500 by 500 foot squares over the boundary of Chicago.
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # <- MDH Added
st_sf() %>%
mutate(uniqueID = rownames(.))
## Neighborhoods to use
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
Food <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5")
FoodInspectAll <- Food %>%
mutate(year = substr(inspection_date,1,4)) %>%
filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Food_Inspect_All")
For anyone wishing to replicate this analysis, we emphasize the importance of visually understanding the spatial distribution of general restaurant inspection failure counts. It is also important to know the spatial layout of independent variables—the factors we use to predict Chicago’s inspection failures. This section provides an overview of our exploratory analysis.
First, we take a look at Chicago’s 2018 inspection failures. There are tens of thousands of inspections for 2018, with just under 20% of them being failed inspections. In our case example, we predict restaurant failures for Chicago in 2019 using 2018 inspection data, so in our exploratory analysis, we consider data from 2018. A simple plot of restaurant inspection failures by location in Chicago is displayed below.
FoodInspect <- Food %>%
mutate(year = substr(inspection_date,1,4)) %>%
filter(year == "2018") %>%
filter(results=="Fail") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Food_Inspect")
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
labs(title= "Failed Food Inspections, Chicago - 2018") +
mapTheme(title_size = 14)
Several clusters immediately stand out. The darkest cluster, to the east, is the Loop. There are also higher numbers of failed inspections along the North Shore, and to the west in Lawndale. The spatial clustering of failed inspections is a good sign that spatial analysis is appropriate in predicting failed inspections.
One of the most important spatial factors in predicting inspection failures is restaurant locations. There cannot be a failure without a restaurant to be inspected in the first place, after all. When examining all restaurant inspection locations, this time, we will map density as a layer rather than plotting each establishment individually. This avoids situations where data for restaurants in close proximity could be obscured due to overlapping, which would be more of a problem given the size of the all-restaurants dataset. Displayed below are side-by-side density maps of all restaurants logged for inspection in 2018 and restaurant inspection failures for that year.
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
stat_density2d(data = data.frame(st_coordinates(FoodInspectAll)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Food Inspections Density") +
mapTheme(title_size = 14) + theme(legend.position = "none"),
ggplot() +
geom_sf(data = chicagoBoundary, fill = "grey80") +
stat_density2d(data = data.frame(st_coordinates(FoodInspect)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis() +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Failed Food Inspection Density") +
mapTheme(title_size = 14) + theme(legend.position = "none")
)
Brighter green represents higher density on the maps above. The maps are very similar, which is a testament to the importance of restaurant locations as a predicting factor. Just by the visual comparison between the graphs, however, there seem to be some areas with higher concentrations of failures in comparison to total inspection numbers (specifically along the North Shore, and in in Lawndale). These patterns support the use of this application, because the differences in the maps show that assigning inspectors purely based on restaurant density would not be the optimal way for the Chicago Health Department to assign its resources. Some areas are more likely have inspection-failing restaurants than others despite similar counts of restaurants. Because failures add an additional inspection, this report’s predictive tool looks to determine where failures are likely to occur.
This tool aims to take data from one year and predict failed food inspections for the following year, so it is useful to compare the locations of two annual sets of food inspections side-by-side. The maps below illustrate locations of failed inspections in 2017 and 2018.
## Get 2017 data
FoodInspect17 <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2017") %>%
filter(results=="Fail") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Food_Inspect")
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = FoodInspect17, colour="red", size=0.1, show.legend = "point") +
labs(title= "2017 Failed Food Inspections") +
mapTheme(title_size = 14),
ggplot() +
geom_sf(data = chicagoBoundary) +
geom_sf(data = FoodInspect, colour="red", size=0.1, show.legend = "point") +
labs(title= "2018 Failed Food Inspections") +
mapTheme(title_size = 14)
)
Failed inspections for 2017 and 2018 look very similar. This supports the notion that one year’s failed inspections can be used to predict those of the next year. Some of the differences can be attributed to the fact that a slightly different set of restaurants are inspected each year. For example, food inspections in 2019 covered 387 additional restaurants compared to 2018.
An ideal spatial process for our model must take into account factors that are relevant at a local level, and that factors affecting one restaurant are very likely to affect those nearby. Additionally, rather than predicting outcomes of individual scheduled inspections on a restaurant by restaurant basis, a spatial process should be used that is compatible with the fact that new establishments are added to the database on a yearly basis.
Predicting on a fishnet grid such that Chicago is divided into equally sized square cells accomplishes this. This model uses the previous year’s inspections and other risk factors to generate a prediction of the number of failed inspections in each cell. The map below displays the actual count of 2018 restaurant failures per cell on the grid. For the purposes of this model, each cell will be 500 square meters, area equivalent to about 2.5 Chicago city blocks.
## Add a value of 1 for each failure, sum them with aggregate
inspect_net <-
dplyr::select(FoodInspect) %>%
mutate(countInspect = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countInspect = replace_na(countInspect, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
# Plot the fishnet displaying the count of inspection failures
ggplot() +
geom_sf(data = inspect_net, aes(fill = countInspect), color = NA) +
scale_fill_viridis() +
labs(title = "Count of Inspection Failures for the Fishnet") +
mapTheme()
# Create a histogram for exploration of expected prediction distribution
HistogramInspect <-
inspect_net %>%
group_by(uniqueID) %>%
summarize(countInspect_sum = sum(countInspect, na.rm = T),
countInspect_mean = mean(countInspect, na.rm = T)) %>%
ungroup()
# plot histogram of failures
HistogramInspect %>%
ggplot(aes(countInspect_sum)) +
geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
scale_x_continuous(breaks = seq(0, 100, by = 5)) +
labs(title="Distribution of Failed Inspection Counts", subtitle = "Chicago, IL",
x="Number of Failed Inspection Counts", y="Number of Cells")
The histogram above illustrates the distribution of cells according to their actual number of failed restaurants for 2018. Of the 2500 cells that comprise Chicago’s area, just over half recorded no failed inspections. The vast majority of cells that did have failed inspections had less than 10, though a few individual cells had 30 or more. When our 2018 model is used to predict failures by fishnet cell for 2019, predictions will have a similar distribution. Our spatial process should produce a similar result, as its estimates will reflect the clustered nature of restaurant failures in Chicago when it makes its predictions.
As stated in the introduction, the goal of the model itself is to spatially predict failed inspections such that an estimate of the total number of inspections can be used to more efficiently allocate inspectors throughout Chicago. To predict inspection failures for a target year, our model uses a generalized linear model. Specifically, our model creates a function which, for each cell in the fishnet, takes a set of dependent variables and returns a predicted number of restaurant inspection failures. The actual predictions for each cell that are used as forecasts of failed inspections for each cell are outputs of k-fold validation. This will be explored in more detail in the next section, but the basic idea is that the predictions created by k-fold cross validation of one year’s data are used as predictions for the ensuing year.
This process, however, is only possible with a set of independent variables whose spatial variation collectively do well to predict restaurant inspection failures. Each independent variable or “risk factor” is explored in more detail in the subsections below.
Inspection Locations - As described in the previous section, inspection locations are vital predictors of the next year’s failure locations. In order for a restaurant to fail an inspection, the inspection must take place at a location, and chances are that location is contained within the set of the previous year’s inspection locations. Like the rest of the risk factors, inspection locations are used in their fishnet form.
# Food establishment locations, 2018
foodInspectLocations <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
distinct() %>%
mutate(Legend = "Food_Inspect_Locations")
Inspection Failure Locations (3NN) - Inspection failure locations cannot be added as a risk factor outright, as k-folds validation in the next section seeks to optimize this exact variable. If inspection failure locations were included as is, k-folds validation would heavily overfit its predictions based on this variable, giving minimal consideration to other variables. The model would be more “accurate” in replicating the locations of inspection failures, but less generalizeable to the following year and less useful as a predictive tool. Instead, an independent variable is introduced reflecting the mean distance to the three nearest inspection failure points for any point in Chicago. Averaged by fishnet cell, this results in a factor that can help predict inspection failure locations without causing the cross validation to overfit.
# Food inspection failures, 2018
# 3NN to be performed on this risk factor later
foodInspectFail <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2018") %>%
filter(results=="Fail") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Food_Inspect_Fail")
The following risk factors are considered to be potential predictors of Failed Food Inspections, and are accessed from the Chicago Open Data Portal:
Rodent Complaints (3NN) - This variable records location-based 311 service requests for rodent baiting over a year. There were 39,023 instances in 2018.
rodentBait <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historic al/97t6-zrhs") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Rodent_Bait")
Sanitation (3NN) - This variable records location-based 311 service associated with sanitation code complaints. There were 3,316 instances in 2018.
sanitation311 <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2018") %>%
filter(what_is_the_nature_of_this_code_violation_ %in%
c("Garbage in Yard","Garbage in Alley","Dumpster not being emptied", "Overflowing carts", "Construction Site Cleanliness/Fence", "Standing water")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation_311")
Ordinance Violation (3NN) - This variable refers to reported Department of Buildings ordinance violations in Chicago. In 2018, there were reports of 43,243 of these.
ordinanceViolation <-
read.socrata("https://data.cityofchicago.org/Administration-Finance/Ordinance-Violations-Buildings-/awqx-tuwv") %>%
mutate(year = substr(VIOLATION.DATE,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = LATITUDE, X = LONGITUDE) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Ordinance_Violations")
Graffiti (3NN) - This records all 311 service requests for the removal of graffiti in Chicago. There were 55,873 of these requests in Chicago in 2018.
graffiti <-
read.socrata(paste0("https://data.cityofchicago.org/Service",
"-Requests/311-Service-Requests-Graffiti-Removal-Historical/",
"hec5-y4x5")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2018") %>%
filter(where_is_the_graffiti_located_ %in%
c("Front","Rear","Side")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
Liquor Retail (3NN) - This variable derives the locations of liquor stores from the Chicago Open Data Portal Database business license data.
liquorRetail <-
read.socrata(paste0("https://data.cityofchicago.org/resource/",
"nrmj-3kcf.json")) %>%
filter(business_activity ==
"Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
All five of these risk factors use Three Nearest Neighbor analysis. While it would not break the model’s generalizability to use these data in their raw form, Three Nearest Neighbor versions of these risk factors tended to perform slightly better in the model. The plot below combines the five new risk factors into a single map below. Note that the lighter colors correspond to cells farther from these factors, meaning that in this case, cells with darker colors have a higher likelihood of failed inspections being predicted in them.
# Create a binded set of nets with the variables of interest collected thus far
vars_net <- rbind(rodentBait, sanitation311, ordinanceViolation, graffiti, liquorRetail, foodInspectFail, foodInspectLocations) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet, by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
# Create long bersion
vars_net.long <-
gather(vars_net, Variable, value, -geometry, -uniqueID)
# Create list
vars <- unique(vars_net.long$Variable)
mapList1 <- list()
# Convenience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
## create NN from rodent bait
vars_net <- vars_net %>%
mutate(rodent_Bait.nn = nn_function(st_c(st_coid(vars_net)),
st_c(rodentBait),
k = 3))
## create NN from sanitation 311
vars_net <- vars_net %>%
mutate(sanitation_311.nn = nn_function(st_c(st_coid(vars_net)),
st_c(sanitation311),
k = 3))
## create NN from ordinance violations
vars_net <- vars_net %>%
mutate(ordinance_Violation.nn = nn_function(st_c(st_coid(vars_net)),
st_c(ordinanceViolation),
k = 3))
## create NN from graffiti
vars_net <- vars_net %>%
mutate(graffiti.nn = nn_function(st_c(st_coid(vars_net)),
st_c(graffiti),
k = 3))
## create NN from liquorRetail
vars_net <- vars_net %>%
mutate(liquor_Retail.nn = nn_function(st_c(st_coid(vars_net)),
st_c(liquorRetail),
k = 3))
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList2 <- list()
for(i in vars){
mapList2[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i),
aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList2, ncol = 3,
top = "Nearest Neighbor risk Factors by Fishnet"))
# create NN from food_Inspect_Fail
# This comes after the others as to not appear in the group of plots
vars_net <- vars_net %>%
mutate(food_Inspect_Fail.nn = nn_function(st_c(st_coid(vars_net)),
st_c(foodInspectFail),
k = 3))
# Gather all .nn vars as vars_net.long.nn
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
We’ll likewise consider the distance to the main business area and the center of Chicago–the Loop. See map below. - see loop as signif in early pages - with ths mind, …
#Distance to Loop to be used later
loopPoint <-
filter(neighborhoods, name == "Loop") %>%
st_centroid()
vars_net$loopDistance =
st_distance(st_centroid(vars_net), loopPoint) %>%
as.numeric()
ggplot() +
geom_sf(data = vars_net, aes(fill=loopDistance), colour=NA) +
scale_fill_viridis(name="loopDistance") +
labs(title="Distance to the Loop") +
mapTheme()
## important to drop the geometry from joining features
final_net <-
left_join(inspect_net, st_drop_geometry(vars_net), by="uniqueID")
We use spatial joins to attach centroids of fishnets to polygon for neighborhoods The map below shows the fishnet as such:
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
# for live demo
mapview::mapview(final_net, zcol = "name")
## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods...
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
# print(final_net.weights, zero.policy=TRUE)
The grid map below shows the following four elements:
## see ?localmoran
local_morans <- localmoran(final_net$countInspect, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# join local Moran's I results to fishnet
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(countInspect = countInspect,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
# varList1 <- list()
#
# for(i in vars){
# varList1[[i]] <-
# ggplot() +
# geom_sf(data = filter(final_net.localMorans, Variable==i),
# aes(fill = Value), colour=NA) +
# scale_fill_viridis(name="") +
# labs(title=i) +
# mapTheme() + theme(legend.position="bottom")}
#
# do.call(grid.arrange,c(varList1, ncol = 2,
# top = "Local Morans I statistics, Theft"))
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList2 <- list()
for(i in vars){
varList2[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme(title_size = 14) + theme(legend.position="bottom")}
do.call(grid.arrange,c(varList2, ncol = 4, top = "Local Morans I statistics, Inspect"))
Once again, we’ll use the nearest neighbor function to find the distance to a hot spot location.
# generates warning from NN
final_net <- final_net %>%
mutate(Inspect.isSig =
ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(Inspect.isSig.dist =
nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net,
Inspect.isSig == 1))),
k = 1))
ggplot() +
geom_sf(data = final_net, aes(fill=Inspect.isSig.dist), colour=NA) +
scale_fill_viridis(name="Inspect.isSig.dist") +
labs(title="NN Distance to Highly Significant \nFailed Inspect Clusters") +
mapTheme()
The plots below show linear correlations between various features of risk factors and failed inspections in Chicago.
ADDITIONAL ANALYSIS GOES HERE
correlation.long <-
st_drop_geometry(final_net) %>%
dplyr::select(-uniqueID, -cvID, -name,-Sanitation_311, -Ordinance_Violations, -Graffiti, -Liquor_Retail, -Rodent_Bait, -Food_Inspect_Fail, -Inspect.isSig.dist ) %>%
gather(Variable, Value, -countInspect)
# colnames(final_net)
# , foodInspectLocations, -Inspect.isSig,
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, countInspect, use = "complete.obs"))
ggplot(correlation.long, aes(Value, countInspect)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor,
aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Failed inspection count as a function of risk factors") +
plotTheme()
The histogram below provides a view of the frequency of failed inspections by neighbiorhood.
The model below uses all seven risk factors as nearest neighbor functions, along with spatial features where failed inspections are significant, as independent variables. Count of failed inspections is taken as the dependent variable. The model cycles through each neighborhood as a hold out, then trains on the remaining cells. Cross-validation is then used to generate predictions for the hold out.
We generate four variations of the model: risk factors with and without spatial features, k-fold and spatial cross-validation.
# necessary workaround for hardcoded "countBurglaries"
# redefine function
remove(crossValidate)
crossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
#cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(paste0(dependentVariable,"~."), family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
# View(crossValidate)
## define independent variables
reg.ss.vars <- c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "Food_Inspect_Locations", "Inspect.isSig", "Inspect.isSig.dist", "loopDistance")
## define independent variables with spatial features
reg.vars <-
c("rodent_Bait.nn", "sanitation_311.nn","ordinance_Violation.nn", "graffiti.nn", "liquor_Retail.nn", "food_Inspect_Fail.nn", "Food_Inspect_Locations")
## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
dataset = dplyr::rename(final_net, countFailures = countInspect),
id = "name",
dependentVariable = "countFailures",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countFailures, Prediction, geometry)
reg.cv <- crossValidate(
dataset = dplyr::rename(final_net, countFailures = countInspect),
id = "cvID",
dependentVariable = "countFailures",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countFailures, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = dplyr::rename(final_net, countFailures = countInspect),
id = "cvID",
dependentVariable = "countFailures",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countFailures, Prediction, geometry)
reg.spatialCV <- crossValidate(
dataset = dplyr::rename(final_net, countFailures = countInspect),
id = "name",
dependentVariable = "countFailures",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countFailures, Prediction, geometry)
Model errors are plotted in a histogram below. The mode of mean absolute errors is 2 thefts. The model can benefit from further refinement based on the presence of neighborhoods with significant errors.
Most grievous errors occur in neighborhoods near the Loop and the…
Comparing mean errors between models shows that spatial features improve our models significantly, compared to using just the risk factors. The comparison map confirms this finding.
reg.summary <-rbind(
mutate(reg.cv,
Error = Prediction - countFailures,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv,
Error = Prediction - countFailures,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV,
Error = Prediction - countFailures,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV,
Error = Prediction - countFailures,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countFailures, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
error_by_reg_and_fold %>%
arrange(desc(MAE))
## Simple feature collection with 396 features and 5 fields
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: 341164.1 ymin: 552875.4 xmax: 367164.1 ymax: 594875.4
## projected CRS: NAD_1983_HARN_StatePlane_Illinois_East_FIPS_1201
## # A tibble: 396 x 6
## Regression cvID Mean_Error MAE SD_MAE geometry
## <chr> <chr> <dbl> <dbl> <dbl> <POLYGON [m]>
## 1 Spatial LOGO-… Rush … 7.80 7.80 7.80 ((358664.1 581375.4, 358664.1 …
## 2 Spatial LOGO-… Rush … 6.90 6.90 6.90 ((358664.1 581375.4, 358664.1 …
## 3 Spatial LOGO-… River… 4.97 4.97 4.97 ((358164.1 579875.4, 357664.1 …
## 4 Spatial LOGO-… Ander… -4.90 4.90 4.90 ((355164.1 589875.4, 354664.1 …
## 5 Spatial LOGO-… Ander… -4.47 4.47 4.47 ((355164.1 589875.4, 354664.1 …
## 6 Spatial LOGO-… River… 3.86 3.86 3.86 ((358164.1 579875.4, 357664.1 …
## 7 Spatial LOGO-… Wrigl… -3.55 3.55 3.55 ((356164.1 586875.4, 356164.1 …
## 8 Spatial LOGO-… Mille… -3.12 3.12 3.12 ((359164.1 578875.4, 358664.1 …
## 9 Spatial LOGO-… Mille… -3.09 3.09 3.09 ((359164.1 578875.4, 358664.1 …
## 10 Spatial LOGO-… Greek… 2.89 2.89 2.89 ((357164.1 578875.4, 357164.1 …
## # … with 386 more rows
# error_by_reg_and_fold %>%
# arrange(MAE)
## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill="#FDE725FF") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) + scale_x_continuous(
breaks = seq(0, 8, by = 1)) +
labs(title="Distribution of MAE",
subtitle = "k-fold cross-validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count") +
plotTheme()
## Errors by Model
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable() %>%
kable_styling("striped", full_width = F) %>%
row_spec(2, color = "black", background = "#FDE725FF") %>%
row_spec(4, color = "black", background = "#FDE725FF")
| Regression | Mean_MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 0.29 | 0.28 |
| Random k-fold CV: Spatial Process | 0.28 | 0.26 |
| Spatial LOGO-CV: Just Risk Factors | 0.59 | 1.13 |
| Spatial LOGO-CV: Spatial Process | 0.52 | 1.06 |
## map of errors by neighborhood
error_by_reg_and_fold %>%
filter(str_detect(Regression, "LOGO")) %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis() +
labs(title = "Theft errors by LOGO-CV Regression") +
mapTheme() + theme(legend.position="bottom")
## neighborhood weights
neighborhood.weights <-
filter(error_by_reg_and_fold,
Regression == "Spatial LOGO-CV: Spatial Process") %>%
group_by(cvID) %>%
poly2nb(as_Spatial(.), queen=TRUE) %>%
nb2listw(., style="W", zero.policy=TRUE)
filter(error_by_reg_and_fold, str_detect(Regression, "LOGO")) %>%
st_drop_geometry() %>%
group_by(Regression) %>%
summarize(Morans_I =
moran.mc(abs(Mean_Error), neighborhood.weights,
nsim = 999, zero.policy = TRUE,
na.action=na.omit)[[1]],
p_value =
moran.mc(abs(Mean_Error), neighborhood.weights,
nsim = 999, zero.policy = TRUE,
na.action=na.omit)[[3]])
## # A tibble: 2 x 3
## Regression Morans_I p_value
## * <chr> <dbl> <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors 0.141 0.02
## 2 Spatial LOGO-CV: Spatial Process 0.122 0.028
Next we’ll use kernel density with varying search radii: 1000, 1500 and 2000 feet. Map below shows the visualization by 3 different search radii.
# demo of kernel width
inspect_ppp <- as.ppp(st_coordinates(FoodInspect), W = st_bbox(final_net))
inspect_KD.1000 <- spatstat.core::density.ppp(inspect_ppp, 1000)
inspect_KD.1500 <- spatstat.core::density.ppp(inspect_ppp, 1500)
inspect_KD.2000 <- spatstat.core::density.ppp(inspect_ppp, 2000)
inspect_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(inspect_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft."))
inspect_KD.df$Legend <- factor(inspect_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
ggplot(data=inspect_KD.df, aes(x=x, y=y)) +
geom_raster(aes(fill=layer)) +
facet_wrap(~Legend) +
coord_sf(crs=st_crs(final_net)) +
scale_fill_viridis(name="Density") +
labs(title = "Kernel density with 3 different search radii") +
mapTheme(title_size = 14)
Kernel density can be viewed on the fishnet with 1500 instances of failed inspections.
as.data.frame(inspect_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean) %>%
ggplot() +
geom_sf(aes(fill=value)) +
geom_sf(data = sample_n(FoodInspect, 1500), size = .5) +
scale_fill_viridis(name = "Density") +
labs(title = "Kernel density of 2018 failed inspections") +
mapTheme(title_size = 14)
To evaluate the generalizability of our model we’ll apply our predictions and compare the results to 2018 failed inspection statistics.
FoodInspect19 <-
read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5") %>%
mutate(year = substr(inspection_date,1,4)) %>% filter(year == "2019") %>%
# filter(results=="Fail") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Food_Inspect")
ACS data allows us to evaluate generalizability by race contexts.
…Do we even want to include this?
One idea would be to run the model with for 2019, with the variables of 2018 except having 2019’s inspection location data (since presumably there is information as to where restaraunts are that need inspecting).
This means running the glm() function outside of the cross validation method, but I think it might be worth it. Apologies for not getting to that over the weekend!